Comprehensions and Generators

What You’ll Learn

  • How comprehensions let you build lists, dicts, and sets in a single expression
  • The connection between comprehensions and mathematical set-builder notation
  • When comprehensions improve readability and when they hurt it
  • How generator expressions and generator functions handle large datasets without loading everything into memory
  • How to build data pipelines with yield

Concept

Declarative vs Imperative

So far, you have built lists by writing for loops that append items one at a time. That is the imperative style: you spell out how to construct the result step by step.

# Imperative: step-by-step instructions
sub_80 = []
for r in rounds:
    score = int(r['total_score'])
    if score < 80:
        sub_80.append(score)

Python also supports a declarative style where you describe what you want, and the language figures out how to build it:

# Declarative: describe the result
sub_80 = [int(r['total_score']) for r in rounds if int(r['total_score']) < 80]

Both produce the same list. The declarative version is called a comprehension.

The Set-Builder Notation Parallel

If you have taken a math course, comprehensions will look familiar. In set-builder notation you write:

\[S = \{x^2 \mid x \in \mathbb{N},\; x < 5\}\]

which reads: “the set of \(x^2\) for every natural number \(x\) less than 5.” The result is \(\{0, 1, 4, 9, 16\}\).

Python’s comprehension syntax mirrors this directly:

s = {x**2 for x in range(5)}

The structure is always:

[expression  for variable in iterable  if condition]
 ^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^
 what you     where the data           optional
 want         comes from               filter

Why Comprehensions Matter

  1. Concise. One line instead of four. Less code means fewer places for bugs to hide.
  2. Readable. Once you learn the pattern, comprehensions telegraph intent: “I am building a collection.”
  3. Pythonic. The Python community treats comprehensions as idiomatic style. You will encounter them constantly in libraries, documentation, and other people’s code.
  4. Slightly faster. Python can optimize comprehensions internally because the intent is unambiguous. The difference is small, but it exists.

When NOT to Use Comprehensions

Comprehensions can become unreadable when they are too long or too deeply nested. A good rule of thumb:

  • One for clause with an optional if? Almost always fine as a comprehension.
  • Two for clauses? Fine if the logic is straightforward, but consider whether a regular loop is clearer.
  • Three or more for clauses, or complex conditional logic? Write a regular loop. You can always refactor later.

If you cannot read the comprehension aloud and understand it in one pass, it is too complex. Readability always wins.


Code

Setup: Loading the Golf Data

We will reuse the pattern from Topic 03: read each CSV into a list of dictionaries with csv.DictReader.

import csv


def read_csv(filepath):
    """Read a CSV file and return a list of dictionaries."""
    with open(filepath, 'r') as f:
        return list(csv.DictReader(f))


players = read_csv('../../data/players.csv')
courses = read_csv('../../data/courses.csv')
holes = read_csv('../../data/holes.csv')
rounds = read_csv('../../data/rounds.csv')
shots = read_csv('../../data/shots.csv')

print(f'Players: {len(players)}')
print(f'Courses: {len(courses)}')
print(f'Holes:   {len(holes)}')
print(f'Rounds:  {len(rounds)}')
print(f'Shots:   {len(shots)}')

1. List Comprehensions

A list comprehension builds a new list by applying an expression to each item in an iterable, with an optional filter. Let’s start with the simplest case.

Basic transformation: scores to relative-to-par

Every course has a par (typically 72). Let’s convert raw scores into relative-to-par values so we can see how each round compared to par. We will use a par of 72 as a baseline.

First, the imperative way with a for loop:

# Imperative approach: for loop
par = 72
relative_scores = []
for r in rounds:
    score = int(r['total_score'])
    relative_scores.append(score - par)

print(relative_scores)
# Declarative approach: list comprehension
par = 72
relative_scores = [int(r['total_score']) - par for r in rounds]

print(relative_scores)

Both produce the same list. The comprehension says what we want (each score minus par) without spelling out the empty-list-and-append mechanics.

Adding a filter: sub-80 rounds

The if clause in a comprehension keeps only items that pass a condition.

# Imperative: filter with a for loop
sub_80 = []
for r in rounds:
    score = int(r['total_score'])
    if score < 80:
        sub_80.append(r)

for r in sub_80:
    print(f"Player {r['player_id']}: {r['total_score']} on {r['date']}")
# Declarative: list comprehension with filter
sub_80 = [r for r in rounds if int(r['total_score']) < 80]

for r in sub_80:
    print(f"Player {r['player_id']}: {r['total_score']} on {r['date']}")

Filtering shots: birdies only

A birdie happens when a player finishes a hole in one stroke fewer than par. We need to combine shot data with hole pars to find birdies. Let’s first build a par lookup, then count strokes per hole per round.

# Build a par lookup: (course_id, hole_number) -> par
par_lookup = {}
for h in holes:
    par_lookup[(h['course_id'], h['hole_number'])] = int(h['par'])

# Build a round-to-course lookup
round_course = {}
for r in rounds:
    round_course[r['round_id']] = r['course_id']

# Count strokes per (round_id, hole)
hole_scores = {}
for s in shots:
    key = (s['round_id'], s['hole'])
    hole_scores[key] = hole_scores.get(key, 0) + 1

# Now find birdies using a list comprehension
birdies = [
    (round_id, hole, strokes, par_lookup[(round_course[round_id], hole)])
    for (round_id, hole), strokes in hole_scores.items()
    if strokes == par_lookup[(round_course[round_id], hole)] - 1
]

print(f'Total birdies across all rounds: {len(birdies)}')
print()
for round_id, hole, strokes, par in birdies[:10]:
    print(f'  Round {round_id}, Hole {hole}: {strokes} strokes (par {par})')

Conditional expressions: scoring labels

A conditional expression (also called a ternary) lets you choose between two values inside the comprehension. This is different from the if filter – it does not exclude items, it picks which value to output.

Syntax: value_if_true if condition else value_if_false

# Label each round's score relative to par
par = 72
labeled_scores = [
    (int(r['total_score']),
     'under par' if int(r['total_score']) < par
     else 'even par' if int(r['total_score']) == par
     else 'over par')
    for r in rounds
]

for score, label in labeled_scores:
    print(f'  {score:>3d}  {label}')

We can apply the same idea to individual holes. In golf, the number of strokes relative to par has a specific name:

def scoring_name(strokes, par):
    """Return the golf term for a hole score relative to par."""
    diff = strokes - par
    names = {-3: 'albatross', -2: 'eagle', -1: 'birdie',
             0: 'par', 1: 'bogey', 2: 'double bogey', 3: 'triple bogey'}
    return names.get(diff, f'+{diff}' if diff > 0 else f'{diff}')


# Build scoring labels for every hole played
hole_labels = [
    scoring_name(strokes, par_lookup[(round_course[round_id], hole)])
    for (round_id, hole), strokes in hole_scores.items()
]

# Count how often each label appears
label_counts = {}
for label in hole_labels:
    label_counts[label] = label_counts.get(label, 0) + 1

print('Score distribution across all rounds:')
for label, count in sorted(label_counts.items(), key=lambda x: x[1], reverse=True):
    print(f'  {label:>15s}: {count}')

2. Dict Comprehensions

A dict comprehension builds a dictionary in one expression. The syntax is the same as a list comprehension but uses curly braces and a key: value pair:

{key_expr: value_expr for variable in iterable if condition}

Player lookup by ID

In Topic 03, we built lookup dictionaries with a for loop. Dict comprehensions make this a one-liner.

# Imperative: for loop
player_lookup = {}
for p in players:
    player_lookup[p['player_id']] = p['name']

print(player_lookup)
# Declarative: dict comprehension
player_lookup = {p['player_id']: p['name'] for p in players}

print(player_lookup)

Scoring stats per player

Let’s build a dictionary that maps each player’s name to their average score. This shows how a dict comprehension can include more complex expressions.

# Group scores by player_id first (using a regular loop)
scores_by_player = {}
for r in rounds:
    pid = r['player_id']
    if pid not in scores_by_player:
        scores_by_player[pid] = []
    scores_by_player[pid].append(int(r['total_score']))

# Now use a dict comprehension to calculate averages
avg_scores = {
    player_lookup[pid]: round(sum(scores) / len(scores), 1)
    for pid, scores in scores_by_player.items()
}

print('Average score per player:')
for name, avg in sorted(avg_scores.items(), key=lambda x: x[1]):
    print(f'  {name:20s} {avg:.1f}')

Club distance averages from shots.csv

For each club, what is the average distance of a shot? We calculate distance as start_distance_to_pin - end_distance_to_pin.

# Group distances by club
club_distances = {}
for s in shots:
    club = s['club']
    distance = float(s['start_distance_to_pin']) - float(s['end_distance_to_pin'])
    if distance > 0:  # exclude putts where ball may go past the hole
        if club not in club_distances:
            club_distances[club] = []
        club_distances[club].append(distance)

# Dict comprehension: club -> average distance
avg_club_distance = {
    club: round(sum(dists) / len(dists), 1)
    for club, dists in club_distances.items()
}

print('Average distance by club:')
for club, avg_dist in sorted(avg_club_distance.items(), key=lambda x: x[1], reverse=True):
    print(f'  {club:>10s}: {avg_dist:>6.1f} yards')

3. Set Comprehensions

A set comprehension builds a set – an unordered collection of unique values. The syntax uses curly braces like a dict comprehension, but without the key: value pair:

{expression for variable in iterable if condition}

Unique clubs used

# All unique clubs in the dataset
all_clubs = {s['club'] for s in shots}
print(f'Unique clubs ({len(all_clubs)}):')
print(sorted(all_clubs))

Unique courses each player has played

# Build a course name lookup
course_lookup = {c['course_id']: c['name'] for c in courses}

# For each player, find unique courses played
for p in players:
    courses_played = {course_lookup[r['course_id']] for r in rounds if r['player_id'] == p['player_id']}
    print(f"{p['name']:20s} played at: {', '.join(sorted(courses_played))}")

Set operations: clubs one player uses that another does not

Sets support mathematical operations: union (|), intersection (&), and difference (-). These are powerful for comparing groups.

# Get rounds for Bear Woods (player_id=1) and Bobby Bogey (player_id=4)
bear_rounds = {r['round_id'] for r in rounds if r['player_id'] == '1'}
bobby_rounds = {r['round_id'] for r in rounds if r['player_id'] == '4'}

# Clubs used by each player
bear_clubs = {s['club'] for s in shots if s['round_id'] in bear_rounds}
bobby_clubs = {s['club'] for s in shots if s['round_id'] in bobby_rounds}

print(f"Bear Woods uses:  {sorted(bear_clubs)}")
print(f"Bobby Bogey uses: {sorted(bobby_clubs)}")
print()
print(f"Clubs Bear uses that Bobby does not: {sorted(bear_clubs - bobby_clubs)}")
print(f"Clubs Bobby uses that Bear does not: {sorted(bobby_clubs - bear_clubs)}")
print(f"Clubs both use:                      {sorted(bear_clubs & bobby_clubs)}")

4. Nested Comprehensions

You can nest for clauses inside a comprehension. The clauses read left to right, just like nested for loops read top to bottom.

Flattening data

Suppose we want a flat list of every (player, course) combination that was played. With nested loops:

# Imperative: nested for loops
player_course_pairs = []
for p in players:
    for c in courses:
        # Check if this player has a round at this course
        played = any(
            r['player_id'] == p['player_id'] and r['course_id'] == c['course_id']
            for r in rounds
        )
        if played:
            player_course_pairs.append((p['name'], c['name']))

for name, course in player_course_pairs:
    print(f'  {name:20s} -> {course}')
# Declarative: nested comprehension
player_course_pairs = [
    (p['name'], c['name'])
    for p in players
    for c in courses
    if any(r['player_id'] == p['player_id'] and r['course_id'] == c['course_id']
           for r in rounds)
]

for name, course in player_course_pairs:
    print(f'  {name:20s} -> {course}')

Here is a simpler example of flattening. Suppose we have hole pars grouped by course and we want a single flat list:

# Group pars by course
pars_by_course = {}
for h in holes:
    cid = h['course_id']
    if cid not in pars_by_course:
        pars_by_course[cid] = []
    pars_by_course[cid].append(int(h['par']))

# Flatten with a nested comprehension
all_pars = [par for course_pars in pars_by_course.values() for par in course_pars]
print(f'Total hole records: {len(all_pars)}')
print(f'All pars: {all_pars}')

Readability warning. Two nested for clauses are the practical limit for comprehensions. Anything deeper should be a regular loop or broken into helper functions. Compare:

# This is hard to read -- do NOT write comprehensions like this
result = [
    f(x, y, z)
    for x in xs
    for y in ys
    if g(x, y)
    for z in zs
    if h(y, z)
]

When you reach this level of complexity, nested loops with clear variable names will be far more maintainable.

5. Generator Expressions

A generator expression looks exactly like a list comprehension, except it uses parentheses instead of brackets:

# List comprehension -- builds the entire list in memory
scores = [int(r['total_score']) for r in rounds]

# Generator expression -- produces values one at a time
scores = (int(r['total_score']) for r in rounds)

The generator does not build a list in memory. Instead, it produces values lazily – one at a time, on demand. This matters when working with large datasets.

# List comprehension returns a list
scores_list = [int(r['total_score']) for r in rounds]
print(type(scores_list))
print(scores_list)

print()

# Generator expression returns a generator object
scores_gen = (int(r['total_score']) for r in rounds)
print(type(scores_gen))
print(scores_gen)  # no values shown -- they haven't been produced yet

Memory efficiency

With our small golf dataset, the difference is negligible. But let’s demonstrate the concept with sys.getsizeof.

import sys

# List comprehension: all values stored in memory
all_distances_list = [float(s['start_distance_to_pin']) for s in shots]
print(f'List: {sys.getsizeof(all_distances_list):>8,} bytes for {len(all_distances_list)} items')

# Generator expression: values produced on demand
all_distances_gen = (float(s['start_distance_to_pin']) for s in shots)
print(f'Generator: {sys.getsizeof(all_distances_gen):>5,} bytes (fixed overhead, regardless of data size)')

Using generators with built-in functions

Generator expressions pair naturally with sum(), max(), min(), any(), and all(). These functions consume the generator one item at a time, so the full list is never in memory.

When you pass a generator to a function that takes a single argument, you can drop the extra parentheses:

sum(int(r['total_score']) for r in rounds)  # no double parentheses needed
# sum() with a generator
total_strokes = sum(int(r['total_score']) for r in rounds)
avg_score = total_strokes / len(rounds)
print(f'Total strokes across all rounds: {total_strokes}')
print(f'Average score: {avg_score:.1f}')
# max() and min() with generators
best_score = min(int(r['total_score']) for r in rounds)
worst_score = max(int(r['total_score']) for r in rounds)
print(f'Best score:  {best_score}')
print(f'Worst score: {worst_score}')
# any() -- did any round score under par (72)?
has_under_par = any(int(r['total_score']) < 72 for r in rounds)
print(f'Any round under par? {has_under_par}')

# all() -- did every round score under 110?
all_under_110 = all(int(r['total_score']) < 110 for r in rounds)
print(f'All rounds under 110? {all_under_110}')

# any() short-circuits: it stops as soon as it finds a True value
# all() short-circuits: it stops as soon as it finds a False value
# This means they are efficient even on huge datasets

6. Generator Functions with yield

A generator expression works for simple transformations. For more complex logic, you can write a generator function using the yield keyword.

A generator function looks like a regular function, but instead of returning a single result, it yields values one at a time. Each time the caller asks for the next value, the function resumes exactly where it left off.

Streaming a large file line by line

Our read_csv() function loads the entire file into memory. For a 2,100-row shots file that is fine, but imagine a file with millions of rows. A generator function can stream it row by row.

def stream_csv(filepath):
    """Yield rows from a CSV file one at a time, without loading the full file."""
    with open(filepath, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row


# Use it -- only one row is in memory at a time
row_count = 0
for row in stream_csv('../../data/shots.csv'):
    row_count += 1

print(f'Streamed {row_count} rows from shots.csv')

The key insight: the with open() block stays open across yield calls. The file is read incrementally and closed automatically when the generator is exhausted (or garbage collected). This means you can process a file larger than your available memory.

Building a data pipeline

Generator functions compose naturally into pipelines where each stage transforms or filters data before passing it to the next. Think of it as an assembly line: each station does one thing.

Let’s build a pipeline: read shots.csv, filter to a specific player, then calculate strokes gained stats.

# Stage 1: Stream raw shot data
def stream_shots(filepath):
    """Yield shot dictionaries from the CSV file."""
    with open(filepath, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row


# Stage 2: Filter to a specific player's rounds
def filter_by_player(shot_stream, player_id):
    """Yield only shots from rounds belonging to the given player."""
    player_rounds = {r['round_id'] for r in rounds if r['player_id'] == player_id}
    for shot in shot_stream:
        if shot['round_id'] in player_rounds:
            yield shot


# Stage 3: Calculate strokes gained per club
def strokes_gained_by_club(shot_stream):
    """Consume a shot stream and return strokes gained totals per club."""
    sg_totals = {}
    sg_counts = {}
    for shot in shot_stream:
        club = shot['club']
        sg = float(shot['strokes_gained'])
        sg_totals[club] = sg_totals.get(club, 0.0) + sg
        sg_counts[club] = sg_counts.get(club, 0) + 1
    return {club: (sg_totals[club], sg_counts[club]) for club in sg_totals}


# Connect the pipeline for Bear Woods (player_id='1')
raw_shots = stream_shots('../../data/shots.csv')
bear_shots = filter_by_player(raw_shots, '1')
bear_sg = strokes_gained_by_club(bear_shots)

print('Bear Woods -- Strokes Gained by Club:')
print(f'{"Club":>10s}  {"Total SG":>10s}  {"Shots":>6s}  {"SG/Shot":>8s}')
print('-' * 40)
for club, (total, count) in sorted(bear_sg.items(), key=lambda x: x[1][0], reverse=True):
    print(f'{club:>10s}  {total:>10.2f}  {count:>6d}  {total/count:>8.3f}')

Notice how each pipeline stage is independent and testable. You can swap stages, add new filters, or reuse them in different combinations. And because each stage yields one row at a time, the pipeline never loads the full dataset into memory.

Let’s reuse the same pipeline for Bobby Bogey to compare:

# Same pipeline, different player
raw_shots = stream_shots('../../data/shots.csv')
bobby_shots = filter_by_player(raw_shots, '4')
bobby_sg = strokes_gained_by_club(bobby_shots)

print('Bobby Bogey -- Strokes Gained by Club:')
print(f'{"Club":>10s}  {"Total SG":>10s}  {"Shots":>6s}  {"SG/Shot":>8s}')
print('-' * 40)
for club, (total, count) in sorted(bobby_sg.items(), key=lambda x: x[1][0], reverse=True):
    print(f'{club:>10s}  {total:>10.2f}  {count:>6d}  {total/count:>8.3f}')

Brief mention: infinite sequences

Because generators produce values lazily, they can represent infinite sequences. You would never build an infinite list, but an infinite generator is perfectly fine as long as the consumer stops at some point.

Here is a quick example outside our golf data: generating round IDs for a hypothetical unlimited tournament.

def round_ids(start=1):
    """Generate round IDs forever."""
    n = start
    while True:
        yield f'R{n:04d}'
        n += 1


# Take only the first 10 IDs
id_gen = round_ids()
first_ten = [next(id_gen) for _ in range(10)]
print(first_ten)

The generator never terminates on its own, but that is fine because the consumer (next() called 10 times) controls how many values to pull. This pattern is useful for assigning unique IDs, generating test data, or simulating ongoing processes.


AI

Exercise 1: Refactor a For Loop into a Comprehension

Give an AI assistant the following code and ask it to refactor the loop into a comprehension:

import csv

with open('../../data/rounds.csv', 'r') as f:
    reader = csv.DictReader(f)
    rounds_data = list(reader)

windy_scores = []
for r in rounds_data:
    if r['weather'] == 'windy':
        score = int(r['total_score'])
        windy_scores.append(score)

avg_windy = sum(windy_scores) / len(windy_scores)
print(f'Average score in windy conditions: {avg_windy:.1f}')

Prompt to use: > Refactor the windy_scores for loop into a list comprehension. Keep everything else the same.

Evaluate the AI’s response: - Does the comprehension produce the same result as the original loop? - Is the comprehension more readable than the loop, or roughly equivalent? - Did the AI change anything it should not have (e.g., the file reading logic or the final calculation)?

# Paste the AI-generated code here and run it.
# Verify the output matches: the average score for windy rounds.

Exercise 2: Explain a Complex Nested Comprehension

Give an AI assistant the following comprehension and ask it to explain what it does:

sg_by_player_club = {
    player_lookup[pid]: {
        club: round(sum(sg) / len(sg), 3)
        for club, sg in club_sg.items()
    }
    for pid, club_sg in (
        (r['player_id'], {})
        for r in rounds
    )
}

Prompt to use: > Explain step by step what this nested comprehension does. What would the output look like? Are there any bugs?

Evaluate the AI’s response: - Does the AI correctly identify that this code has a bug? (The inner dict club_sg is always empty, so the inner comprehension produces an empty dict for every player.) - Does the AI explain the intended purpose versus what the code actually does? - Does the AI suggest a corrected version?

This exercise tests whether you can use AI to debug code, not just generate it. A good AI response should catch the logical error.

# Paste the AI's explanation and corrected code here.
# Try running the corrected version to verify it works.

Exercise 3: When to Use a Generator vs a List Comprehension

Prompt to use: > When should I use a generator expression instead of a list comprehension in Python? Give concrete examples of when each is the better choice.

Evaluate the AI’s response. A good answer should mention:

  1. Memory efficiency – generators do not store all values in memory, making them better for large datasets.
  2. Lazy evaluation – generators produce values on demand, which means they can short-circuit (e.g., with any() or all()).
  3. Large or streaming data – when reading files line by line, processing API responses, or working with datasets that do not fit in memory.
  4. When you need to iterate only once – generators can only be consumed once. If you need to loop over the data multiple times, a list is required.
  5. When you need indexing or len() – lists support data[3] and len(data), generators do not.

If the AI misses any of these points, that is worth noting. Does it give practical examples, or only abstract explanations?

# Paste the AI's response here as a comment or markdown.
# Note which of the 5 evaluation points the AI covered.

Summary

Concept Syntax When to Use
List comprehension [expr for x in iterable if cond] Building a new list by transforming and/or filtering an iterable
Dict comprehension {k: v for x in iterable if cond} Building a lookup dictionary or aggregating data by key
Set comprehension {expr for x in iterable if cond} Collecting unique values; then use set operations (-, &, \|) to compare groups
Nested comprehension Multiple for clauses in one expression Flattening nested data; limit to two for clauses for readability
Conditional expression a if cond else b inside a comprehension Choosing between two output values (not filtering)
Generator expression (expr for x in iterable if cond) Feeding data into sum(), max(), min(), any(), all() without building an intermediate list
Generator function def f(): ... yield val Streaming large files, building multi-stage data pipelines, producing infinite sequences

Next up: Topic 05 – Classes and Data Modeling.

Get the Complete Course Bundle

All notebooks, the full golf dataset, and new tutorials — straight to your inbox.